import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from pandas_profiling import ProfileReport
import seaborn as sns
#we create the dataframes with the function pd.read_csv since we start from files in this format
diamonds = pd.read_csv('../data/01_raw/diamonds.csv',delimiter = ',') #We indicate that the separator is the comma
diamonds.head(10)
diamonds.tail(10)
diamonds.shape #in order to observe the number of rows and columns of our dataset
diamonds.cut.unique() #To know the different type of qualities
#Let's get an idea of how many we have of each
diamonds.cut.value_counts()
diamonds. isnull(). sum() #theres no NaN
diamonds.iloc[:,0] = diamonds.iloc[:,0].astype('str')#we dont want the index to be in the summary so we conver in object
# a summary of the numeric columns to have an idea of the range of the values.
duplicated_values = diamonds[diamonds.duplicated()]
print(duplicated_values) #theres no duplicated values
diamonds1 = diamonds.rename(columns={'Unnamed: 0': 'Index','z':'depth_mm','x':'length_mm','y':'width_mm','depth':'depth_%'})
# Changing the name of the column.
diamonds1.info() #in order to know the Dtype, and confirming Index has an object Dtype
diamonds1.describe() #Here we can see a quick summary of the numercical features
corr = diamonds1.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f,ax = plt.subplots(figsize=(15,15))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.6, cbar_kws={"shrink": .5})
Correlation matrix shows how our features are related. As is observed,there is a strong correlation between the weight of our diamond with the dimensions and the price, as was commented previously.Additionally, as is expected theres a high correlation between the dimensions of our diamond.
sns.pairplot(diamonds1, hue="cut", diag_kind="hist")
Pair-plots are way to see relationships between each variable. It produces a matrix of relationships between each variable.
Let's take a look about how data is distributed as function of the cut which represent the quality of the diamond.
sns.displot(diamonds1, x="carat", hue="cut",kde=True)
plt.xlabel('Weight of the diamond in grams')
sns.displot(diamonds1, x="depth_%", hue="cut",kde=True)
plt.xlabel('Depth percentage')
sns.displot(diamonds1, x="table", hue="cut",kde=True)
plt.xlabel('Table percentage')
sns.displot(diamonds1, x="price", hue="cut",kde=True)
plt.xlabel('$')
sns.displot(diamonds1, x="length_mm", hue="cut",kde=True)
plt.xlabel('Length in mm')
sns.displot(diamonds1, x="width_mm", hue= "cut",kde=True)
plt.xlabel('Width in mm')
sns.displot(diamonds1, x="depth_mm", hue="cut",kde=True)
plt.xlabel('Depth in mm')
The boxplots depict groups of numerical data through their quantiles.
sns.boxplot(x='cut',y='price',data=diamonds1)
From this boxplot can be observed that except for ideal cut, the lower is the quality, there are more outliers. Moreover, each type of cut has the same minimum and maximum price.
sns.boxplot(x='clarity',y='price',data=diamonds1)
No trend can be drawn from this graph, however,each type of cut has the same minimum and maximum price.
sns.boxplot(x='color',y='price',data=diamonds1)
From this boxplot can be observed that except for G, the better is the quality of the color, there are more outliers. Moreover, each type of cut has the same minimum and maximum price.
cut_clarity_table = pd.crosstab(index=diamonds1["cut"], columns=diamonds1["clarity"])
cut_clarity_table.plot(kind="bar",
figsize=(12,12),
stacked=True)
As it observed that quality of cut if the parameter that drives people to buy the diamond,regardless the quality.
color_clarity_table = pd.crosstab(index=diamonds1["color"], columns=diamonds1["clarity"])
color_clarity_table.plot(kind="bar",
figsize=(12,12),
stacked=True)
People prefer G and E colour over the others, as well as the the most sought clarities are VS2 and SI1.
clarity_cut_frame = pd.crosstab(index=diamonds1["clarity"], columns=diamonds1["cut"])
clarity_cut_frame.plot(kind="bar",
figsize=(12,12),
stacked=True)
The most bought diamonds are the ones with the clarity SI1 and VS2,with an ideal cut, followed closely by the premium cut, nonetheless, it is observed that people preffer to choose by color instead of clarity.
profile = ProfileReport(diamonds1,title='EDA of Diamonds Dataset')
profile